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ABSTRACT 

The Durham adaptive optics real-time controller is a generic, high performance real- 
time control system for astronomical adaptive optics systems. It has recently had new 
features added as well as performance improvements, and here we give details of these, 
as well as ways in which optimisations can be made for specific adaptive optics systems 
and hardware implementations. We also present new measurements that show how this 
real-time control system could be used with any existing adaptive optics system, and 
also show that when used with modern hardware, it has high enough performance to 
be used with most Extremely Large Telescope adaptive optics systems. 

Key words: Instrumentation: adaptive optics, techniques: image processing, instru- 
mentation: high angular resolution 
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1 INTRODUCTION 



Adaptive optics (AO) (|Babcocklll953l ) is a technique for mit- 
igating the degrading effects of atmospheric turbulence on 
the image quality of ground-based optical and near-IR tele- 
scopes. It is critical to the high angular resolution perfor- 
mance of the next generation of Extremely Large Telescope 
(ELT) facilities, which will have primary mirror diameters 
of up to 40 m. Without mitigation, the general spatial res- 
olution of such a telescope would be subject to the same 
atmospheric limitations as a 0.5 m diameter telescope. The 
proposed ELTs represent a large strategic investment and 
their successful operation depends on having a range of 
high performance heterogeneous AO systems. As such, these 
telescopes will be the premium ground based optical and 
near-IR astronomical facilities for the next two decades. The 
ELTs will, however, require a very significant extrapolation 
of the AO technologies currently deployed or under develop- 
ment for existing 4-10 m telescopes. Not least amongst the 
required developments of current AO technology is the area 
of real-time control. In this paper we describe the develop- 
ment and testing of a real-time controller, with the required 
scalability. 

The Durham adaptive optics real-time controller 
(DARC) (Bas den et alJ 12010) is a real-time control sys- 
tem for AO which was initially developed to be used with 
the CANARY on-s ky multi-object AO (MOAO) technol- 
ogy demonstrator (|Gendron et alJ boill h As such, it was 
a significant success, being stable, configurable and pow- 
erful, and able to meet all the needs for this AO system. 
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There was also demand for DARC to be used with other 
instruments, and so a further improved version of DARC 
was released to the public using an open sou rce GNU-GPL 
jhttp: / /www. gnu.org/copvleft/gpl.htmlll2007f ) license. Here, 
we provide information about the DARC platform, including 
new features, architectural changes, modularisation, perfor- 
mance estimates, algorithm implementation and generali- 
sations. We have tested the performance of this system in 
configurations matching a wide range of proposed ELT AO 
systems, and also configured to match proposed high order 
8 m class telescope AO systems, and a selection of results 
are presented here. 

To date, each AO system commissioned on a telescope 
or used in a laboratory has generally had its own real-time 
control system, leading to much duplicated effort. So far as 
we are aware the only other multi-use high performance real- 
time control system for AO is the European Southern Obser- 
vatory (ESO) Standard Platform for Advan ced Real-Time 
Applications (SPARTA) (|Fedrigo et al.ll2006l L Like DARC, 
SPARTA is designed to support heterogeneous components 
in high performance configurations, including computational 
hardware other than standard PCs, and is currently be- 
ing integrated with second generation V ery Large Te l escop e 
(VLT) in struments such as S PHERE dFusco et al.l I2 OO6L 
GRA AL (jPaufique et al.ll2010r i and GALACSI ( |Stuik et al l 
|2006| ). though is not used outside ESO. The full SPARTA 
system cannot be used with just standard PCs and so expert 
programming skills are required, as well as dedicated soft- 
ware maintenance, and so would be unsuitable and costly 
for simple laboratory setups. A solution to this is to develop 
a system that is flexible enough to satisfy most performance 
requirements, is able to meet challenging AO system spec- 
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ifications using standard PC hardware, is simple to set up 
and use, and also supports hardware acceleration so that it 
is powerful enough to be used with demanding applications 
on-sky. DARC has been designed with these requirements in 
mind, and here we seek to demonstrate how it can be suited 
for most AO systems. A common AO real-time control sys- 
tem of this kind would be beneficial for the AO community, 
leading to a reduction in learning time, and increased system 
familiarisation. 

This central processing unit (CPU)-based approach 
to AO real-time control has not been successful in the 
past because it has been deemed that previously available 
CPUs have not been sufficiently powerful to meet the de- 
manding performan ce requirements of on-sky AO systems 
i|Fedrigo et alj|2006ft . The advent of multi -core processors, 
which started to become commonly available from the mid- 
20008, has been a key enabling factor for allowing our ap- 
proach to succeed (our development commenced in 2008). 
It is now possible to obtain standard PC hardware with 
enough CPU cores to not only perform the essential real- 
time pipeline calculations (from wavefront sensor data to 
deformable mirror (DM) commands), but also to perform 
necessary sub-tasks, such as parameter control, configura- 
tion, and sharing of real-time system information and diag- 
nostic streams. Attempts at using a single core CPU for all 
these tasks have generally failed because context switching 
between these tasks has led to unacceptable jitter in the AO 
system. However, multi-core systems do not suffer so much 
from this unpredictability. Previous systems have also not 
been freely available and have been closed source, which has 
greatly hindered uptake particularly for laboratory bench 
based systems. Our approach does not have these restric- 
tions. 

Commercially available offerings, although impressive 
in many respects, do not scale well for use with future high 
order AO system designs, and are often restricted to specific 
hardware, are designed for laboratory systems and can lack 
features required for high order on-sky AO systems, such as 
pipe-lined pixel-stream processing. 

There are three main areas of application for DARC. 
As a laboratory AO real-time control system (RTCS) where 
flexibility and modularity are key, as well as stability. As a 
control system for instruments on 8-10 m class telescopes, 
where it is being evaluated for use with a number of AO 
systems currently under development. Finally, as a control 
system for ELT instruments, where it is a currently a po- 
tential candidate for use with two proposed AO systems. 

In §2, we give an overview of the new features, including 
advanced algorithms, modularisation, diagnostic data han- 
dling and tools available with DARC. In §3, we discuss how 
DARC can be optimised for use with a given AO system, 
and present some results demonstrating this optimisation. 
In §4, we provide some examples of how DARC can be used 
with some existing and proposed demanding AO systems, 
and demonstrate the hardware that would be required for 
such operation. Finally, in §5 we draw our conclusions. 



2 DARC FEATURES AND ALGORITHMS 

There are several important architectural changes that 
have been made between the original version of DARC 



jBasden et all l201Ch as used on-sky with CANARY and 
the current freely available version (to be used with fu- 
ture phases of CANARY) and these are are discussed here. 
The changes include improved and expanded modularisa- 
tion, changes to diagnostic data handling, improvements to 
graphical processing unit (GPU) acceleration, the ability to 
be used asynchronously with multi-rate cameras, a generali- 
sation of pixel handling, advanced spot tracking algorithms, 
improved command line tools and the ability to use DARC 
in a massively parallelised fashion across multiple nodes in 
a computing grid. Hardware acceleration support has been 
improved principally through the increased modularisation 
of DARC, and overall performance has also been dramati- 
cally improved. 

DARC has also acquired the ability to allow user pa- 
rameter changes on a frame-by-frame basis, allowing more 
complete control and dynamic optimisation of the AO sys- 
tem. The control interface has added functionality that in- 
cludes parameter subscription and notification, and greater 
control of diagnostic data, including redirection, partial sub- 
scription and averaging. 

2.1 DARC modularisation 

Since conception, DARC has always had some degree of 
modularisation; it has been possible to change cameras, 
deformable mirrors and reconstruction algorithms be dy- 
namically loading and unloading modules into the real-time 
pipeline, without restarting DARC. This modularisation has 
been extended to increase the degree of user customisation 
that is possible with DARC. Module interfaces that allow 
modules to be dynamically loaded and unloaded in DARC 
now additionally include image calibration and wavefront 
slope computation interfaces, and an asynchronous open- 
loop DM figure sensing interface, as well as the pre-existing 
wavefront reconstruction interface. A parameter buffer in- 
terface has also been added, allowing customisation of high- 
speed parameter input, facilitating the adjustment of any 
parameter on a frame- by- frame basis, thus allowing ad- 
vanced use of the system, for example DM modulation and 
fast reference slope updating. 

The user application programming interface (API) for 
developing DARC modules has been rationalised and now 
most modules include similar functions, which are called at 
well defined points during the data-processing pipeline. This 
allows the developer to easily identify which functions are 
necessary for them to implement to achieve optimum AO 
loop performance, and encourages the consideration of al- 
gorithms that reduce latency, and improve real-time perfor- 
mance. 

Although it is possible to implement a large number of 
functions in each of these modules, typically they are not 
all required and so the developer should implement only the 
necessary subset for their particular application. 

2.1.1 Module hook points 

D ARC uses a horizont al processing strategy as described 
bv lBasden etakl l|2010l ). which splits computational load as 
evenly as possible between available threads, thus allowing 
good CPU load balancing and high CPU resource utilisa- 
tion, giving low latency performance. To achieve this, each 
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Figure 1. A figure demonstrating the points (in time) at which 
DARC module functions, if implemented, are called. Three WFS 
frames are shown with time increasing from left to right, with 
a new module being loaded at the start (point 1). This assumes 
there are four main processing threads (Threadl, T2, Thrcad3 
and Thread4), and the thread responsible for pre-processing and 
post-processing is also shown ( "End of frame thread" ) . The legend 
shows the function names corresponding to the points (in time) 
at which the functions are called, for example, (1) is the point at 
which the module "Open" function is called, thus initialising the 
module. The process function (5) is called repeatedly each frame 
until the frame has been processed (for example, once for each 
sub-aperture). Further explanation is given in the main text. 



calls a "StartFrame" function, for per-thread, per-frame 
initialisation. The "Process" function is then called mul- 
tiple times, while there is still data to be processed (for 
a Shack-Hartmann system this is typically once per sub- 
aperture, shared between the available processing threads). 
Once all such processing has finished, each processing thread 
calls an "EndFrame" function, which would typically per- 
form gather operations to collate the results (for example 
summing together partial DM vectors). A single thread is 
then chosen to call a "FrameFinishedSync" function to fi- 
nalise this frame. After this function has been called, the 
"end-of-frame" thread springs into life, allowing the pro- 
cessing threads to begin processing of the next frame, start- 
ing with the "NewFrameSync" function. The "end-of-frame" 
thread calls a "FrameFinished" function for finalisation dur- 
ing which (for a mirror module) commands should be send to 
the DM. A "Complete" function is then called for each mod- 
ule, which can be used for initialisation ready for the next 
frame. The "end-of-frame" thread then calls a "NewFrame" 
function. The "end-of-frame" thread is not synchronised 
with the processing threads, and so there is no guarantee 
when the "NewFrame" function is called relative to the func- 
tions called by the processing threads, except that process- 
ing threads will block before calling the "Process" function 
until the "end-of-frame" thread has finished. 



thread must be responsible for performing multiple algo- 
rithms, including wavefront sensor (WFS) calibration, slope 
computation and partial wavefront reconstruction (rather 
than separate threads performing calibration, slope compu- 
tation and reconstruction). To reconcile this with the mod- 
ular nature of DARC, there are defined points at which 
module functions are called as shown in Fig. [T] A mod- 
ule developer then fills in the body of the module functions 
that they require. DARC will then call these functions at 
the appropriate time in a thread-safe way. Fig. [1] shows the 
DARC threading structure and the points at which module 
functions are called. It should be noted that the "Process" 
function is called multiple times for each WFS frame until 
all the data have been processed (i.e. called for each sub- 
aperture in a Shack-Hartmann system). This approach has 
been taken to encourage a module developer to consider how 
their algorithm best fits into a low latency architecture, and 
to provide a consistent interface between modules. Unim- 
plemented functions (those that are not required for a given 
algorithm) are simply ignored. 

We make a distinction between processing threads (la- 
belled Threadl, Thread3 etc. in Fig. [TJ which do the ma- 
jority of the parallelised workload, and the "end-of-frame" 
thread which is used to perform sequential workloads such 
as sending commands to a deformable mirror. Upon initial- 
isation of a module, the module "Open" function is called 
by a single processing thread. Here, any initialisation re- 
quired is performed, such as allocating necessary memory, 
and (for a camera module) initialising cameras. Parameters 
(such as a control matrix, or camera exposure time) are 
passed to the module using the DARC parameter buffer. To 
access this buffer, the "NewParam" function is then called 
(if implemented). After this, the module is ready to use. 
One processing thread calls a "NewFrameSync" function, 
for per-frame initialisation. Each processing thread then 



Whenever the DARC parameter buffer is updated, the 
"NewParam" function will be called by a single processing 
thread just before the "NewFrameSync" function is called. 
When the module is no longer required (for example when 
the user wishes to try a new algorithm available in another 
module), a "Close" function is called, to free resources. 

The large number of functions may seem confusing, 
particularly since some appear to have similar functional- 
ity. Fortunately, most modules need only implement a small 
sub-set of these functions. The full suite of functions have 
been made available to give a module developer the required 
flexibility to create a module that is as efficient as possible, 
minimising AO system latency. 

Matrix operations are highly suited to this sort of hor- 
izontal processing strategy since they can usually be highly 
parallelised, and thus divided between the horizontal pro- 
cessing threads. Wavefront reconstruction using a standard 
matrix-vector multiplication algorithm (with a control ma- 
trix) is therefore ideally suited. Iterative wavefront recon- 
struction algorithms, for example those based around the 
conjugate gradient algorithms are less easy to parallelise, 
since each iteration step depends on the previous step. How- 
ever, a horizontal processing strategy does allow the first 
iteration to be highly parallelised, which can lead to signifi- 
cant performance improvements when the number of itera- 
tions is small, for example when using appropriate precon- 
ditioners such as th ose used in the fractal iterative method 
i|Bechet et all 120061 ). The post -processing (end of frame) 
thread (or threads) can then be used to compute the remain- 
ing iterations. Similarly, any system that requires multiple 
step reconstruction, for example a projection between two 
vector spaces, such as for true modal control, can be easily 
integrated. 
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2.1.2 DARC camera modules 

An example of several simple camera modules are provided 
with the DARC source code. These are modules for which 
camera data are only available on a frame-by-frame basis 
(rather than a pixel-by-pixel basis), and typically would 
be used in a laboratory rather than on-sky. In this case, 
the camera data are transferred to DARC at the start 
of each frame (using the "NewFrameSync" function), and 
other functions are not implemented (except for "Open" and 
"Close" ) . For such cameras, there is no interleaving of cam- 
era readout and pixel processing, and so a higher latency 
results. 

For camera drivers which have the ability to provide 
pixel stream access (i.e. the ability to transfer part of a frame 
to the computer before the detector readout has finished), 
a more advanced camera module can be implemented, and 
examples are provided with DARC. Such modules allow in- 
terleaving of camera readout with processing, making use of 
the "Process" function to block until enough pixels have ar- 
rived for a given WFS sub-aperture, after which calibration 
and computation of this sub-aperture can proceed. 



2.1.3 Parameter updating 

DARC has the ability to update any parameter on a frame- 
by-frame basis. However, since this could mean a large com- 
putational requirement (to compute the parameters from 
available data), or a large data bandwidth requirement (for 
example for updating a control matrix), this update ability 
is implemented using a DARC module interface. The user 
can then create such a module depending on their specifica- 
tions to best suit the needs and requirements of their system, 
for example using a proprietary interconnect to send the pa- 
rameters. A standard set of DARC functions are provided, 
which should be overwritten for this parameter buffer inter- 
face to take effect. This buffer module is again dynamically 
loadable, meaning that this ability can easily be switched 
on and off. 

This ability to update parameters every frame or in a 
deterministic fashion (at a given frame number for example) 
is optional, and additional to the non-real-time parameter 
update facility that is used to control DARC as discussed by 
I Basden et~ al. (2010), which allows parameters to be changed 
in a non-deterministic fashion (non-real-time i.e. there is no 
guarantee that a parameter will be changed at a particular 
frame number or time). 



2.2 Diagnostic data handling 

The general concept of diagno stic data handling is described 
in the original DARC paper jBasden et al.ll2010h : In sum- 
mary, there is a separate diagnostic stream for each di- 
agnostic data type (raw pixel data, calibrated pixel data, 
wavefront slope measurements, mirror demands, etcetera). 
It should be noted that diagnostic data handling is used 
only to provide data streams to clients, not for the transfer 
of data along the real-time pipeline. As such, the diagnostic 
data system does not need to be hard-real-time. 

Transport of DARC diagnostic data uses TCP/IP sock- 
ets by default, which lead to a reliable, simple and fairly high 
performance system that is easy to understand and set up, 
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Figure 2. A figure demonstrating the DARC diagnostic data 
system. Here, raw pixel, slope and mirror data are being sent 
from the real-time part of the system to one local node, which is 
in turn sending raw pixels and slopes to other nodes. A further 
local node is receiving slopes directly from the real-time part of 
the system. Clients which process these data are represented by 
grey circles. 



with minimal software configuration and installation. How- 
ever, because DARC is modular by design, users are able to 
replace this system with their own should they have a need 
to, using their own transport system and protocols to dis- 
tribute this data to clients. This is well suited to a facility 
class telescope environment, where standardised protocols 
must be followed. 



2.2.1 Default diagnostic data implementation 

The default DARC diagnostic system seeks to minimise net- 
work bandwidth as much as possible using point-to-point 
(PTP) connections. Diagnostic data are sent from the real- 
time system to a remote computer, where the data are writ- 
ten into a shared memory circular buffer. Clients on this 
computer can then access the data by reading from the 
circular buffer, rather than requesting data directly from 
the real-time system across the network. Additionally, these 
data can then be re-distributed to further remote comput- 
ers, allowing the data to be read by other clients here, as 
shown in Fig. [2] Hence, each diagnostic stream needs to be 
sent only once (or, depending on network topology, a small 
number of times) from the main real-time computer (rather 
than once per client), and network bandwidth can be tightly 
controlled. 

In cases where only part of a diagnostic stream is re- 
quired (for example, pixel data from a single camera in a 
multi-camera system, or a sub-region of an image), these 
data can be extracted by a DARC client into a new diagnos- 
tic stream before being distributed over the network, reduc- 
ing bandwidth requirements. Additionally, for cases where 
only the sum or average of many frames of data from a given 
stream is required this operation can be performed using a 
client provided by DARC before data are transported over 
the network, again greatly saving network bandwidth, for 
example for image calibration. 

It should be noted that the default diagnostic system 
does not use broadcasting or multicasting. This is because 
broadcasting and (in its simplest form) multicasting are in- 
herently unreliable and thus would not provide a reliable di- 
agnostic stream, giving no guarantee that data would reach 
their destination, which is undesirable for an AO system. 
Although reliable multicasting protocols are available, a de- 
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cision has been made not to use these by default, because 
this would increase the complexity of DARC, and would be 
unnecessary for users of simple systems. However we would 
like to reiterate that such diagnostic systems can easily be 
added and integrated with DARC by the end-user. 

On the real-time computer and remote nodes with di- 
agnostic clients, a region of shared memory (in /dev/shm) 
is used for each diagnostic stream, implemented using a self 
describing circular buffer typically hundreds of entries long 
(though this is configurable and in practise will depend on 
available memory, and stream size) . Streams can be individ- 
ually turned on and off as required using the DARC control 
interface (using a graphical, script or command line client 
or the API interface). DARC will write diagnostic data to 
the circular buffers of streams that are not switched off, and 
these data can then be read by clients or transported. The 
rate at which DARC writes these data can be changed (every 
frame, or every n frames with a different rate for each diag- 
nostic stream). Clients can then retrieve as much or as little 
of these data as they require, by setting the sub-sampling 
level at which they wish to receive. 

Since TCP/IP is used, retransmission of data may be 
necessary when network hic-cups occur (though this is han- 
dled by the operating system). The processes responsible for 
sending data over the network to clients will block until an 
acknowledgement from the client has been received (this is 
handled by the operating system), and therefore at times 
may block for a longer than average period while waiting 
for retransmission. If this happens frequently enough (for 
example on a congested network), then the head of the cir- 
cular buffer (the location at which new diagnostic data are 
written to the circular buffer) will catch up with the tail of 
the buffer (the location at which data are sent from). To 
avoid data corruption, the DARC sending process will jump 
back to the head of the circular buffer once the tail of the 
buffer falls more than 75 % of the buffer behind the head. 
Therefore, a chunk of frames will be lost. This is undesir- 
able, but should be compared with what would occur with 
an unreliable protocol, for example UDP. Here, the sending 
process would not be blocked, and so would keep up with the 
head of the circular buffer. However when a packet fails to 
reach its destination, due to congestion or a network hic-cup, 
the packet would simply be dropped and not retransmitted. 
Therefore, there would be a portion of a frame of data miss- 
ing (corresponding to the dropped packet), rendering (in 
most cases) the entire frame unusable. We should therefore 
consider two cases: In a highly loaded network, it is likely 
that the number of partial (unusable) frames received would 
be greater than the number of frames dropped when using 
a reliable protocol even though the total network through- 
put would be greater, because dropped packets would be 
dispersed more-or-less randomly affecting a greater number 
of frames, while dropped frames would be in chunks. In a 
less loaded network, UDP packets would still be occasion- 
ally dropped, resulting in unusable frames, while a reliable 
protocol would (on average) be able to keep up with the 
circular buffer head, dropping behind occasionally, but not 
far enough to warrant dropping a chunk of frames to jump 
back to the circular buffer head. 



2.2.2 Disadvantages 

The disadvantage of this simple approach to diagnostic data 
is that TCP/IP is unicast and PTP, meaning that if multi- 
ple clients on different computers are interested in the same 
data then the data are sent multiple times, a large over- 
head. However, we take the view that the simplicity of the 
default system out-weights the disadvantages, and that for 
more advanced systems, a separate telemetry system should 
be implemented, for which, we are unable to anticipate the 
requirements. 

2.2.3 Diagnostic feedback 

We have so far discussed the ability to propagate real-time 
data to interested clients. However, it is often the case that 
these data will be processed and then injected back into 
the real-time system on a per-frame basis, useful not only 
for testing, but also for calibration tools such as turbulence 
profiling, or deformable mirror shape feedback. This feed- 
back interface module can be implemented in DARC in sev- 
eral ways. One option is to use the per-frame parameter 
update module interface as discussed previously, provided 
by the user to suit their requirements. Another option is to 
modify an existing DARC processing module (dynamically 
loadable), to accept the expected input, for example modify 
a wavefront reconstruction module to accept an additional 
input (via Infiniband or any desired communication proto- 
col) of actual DM shape, to be used for pseudo-open-loop 
control reconstruction. 

2.3 User facilities 

The DARC package comes with an extensive suite of user 
tools, designed to simplify the setup and configuration 
of DARC. These include command-line based tools, a 
graphical interface, a configurable live display tool and 
an API. Since the DARC control interface is based upon 
Common Object Request Broker Architecture (CORB A) 
I http:/ /www.omg.org/cgi-bin/doc?formal/OQ 10-33. pdll 
2000), a client can be written in any programming language 
which has a suitable CORBA implementation. 

Using these tools allows additional customised packages 
and facilities to be built, specific for the AO system in ques- 
tion. An example of this would be a tip-tilt offload system, 
which would capture slope diagnostic data from DARC, and 
if mean slope measurements became too large would inform 
the telescope to update its tracking. Many such systems 
based around DARC diagnostic data have been used suc- 
cessfully with CANARY. 

2.4 Configurable displays 

DARC does not know the nature of the data in the diag- 
nostic streams that it produces. It is known, for example, 
that a particular stream contains DM demand data, and 
how many DM actuators there are; however, DARC knows 
nothing about the mapping of these actuators onto physical 
DMs, how many DMs there are, and what geometry they 
have. Therefore, to display this data in a way meaningful to 
the user, additional information is required. 

The DARC live display tool allows user configuration, 
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Figure 3. A figure showing two instances of the DARC live dis- 
play tool. On the left hand side a one dimensional display of slope 
measurements is shown, along with the associated toolbar, con- 
figured for this AO system (CANARY) allowing the user to select 
which slopes from which wavefront sensor to display. On the right 
hand side, a two dimension wavefront sensor image is displayed, 
with sub-aperture boundaries and current spot locations overlain 
(as a grid pattern, and cross hairs respectively), and the display 
is receiving multiple diagnostic streams (image and slopes) simul- 
taneously. In this display, the toolbar is hidden (revealed with a 
mouse click). 



both manually entered and via configuration files. Addition- 
ally, a collection of configuration files can be used and the 
live display will present a selection dialogue box for the user 
to rapidly switch between configurations, for example to dis- 
play a phase map of different DMs, or to switch between a 
phase map and a WFS image display. Each configuration can 
specify which diagnostic streams should be received and at 
what rate, allowing a given configuration to receive multiple 
diagnostic streams simultaneously, for example, allowing a 
spot centroid position to be overlain on a calibrated pixel 
display. Manually entered configuration is useful while an 
AO system is being designed and built, and can be used for 
example to change a one dimensional pixel stream into a two 
dimensional image for display. 

The live display can be configured with user selectable 
buttons, which can then be used to control this display con- 
figuration. For example, when configured for a WFS pixel 
display, the user might be able to turn on and off a vector 
field of the slope measurements by toggling a button. 

This configurability is aimed at ease of use, as a simple 
way of getting a user friendly AO system up and running. 
Although not designed to be used as a facility class display 
tool, it does have sufficient flexibility and capability to func- 
tion as such. Fig. [3] shows two instances of the display tool 
being used to display WFS slope measurements and cali- 
brated pixel data. 



2.5 DARC algorithms 

Since DARC is primarily based around CPU computation 
and is modular, it is easy to implement and test new al- 
gorithms. As a result, there are a large number of algo- 
rithms implemented, and this list is growing as new ideas or 
requirements arise. Some of these algorithms are given by 
iBasden et alj |201fj ). and we present additional algorithms 
here. 



2. 5. 1 Pixel processing and slope computation 

The use of a brightest pixel selection algorithm 

l|Basden et al.l 1201 ll ) has been demonstrated success- 
fully on-sky using DARC. This algorithm involves selecting 
a user determined number of brightest pixels in each 
sub-aperture and setting the image threshold for this 
sub-aperture at this level. This helps to reduce the impact 
of detector readout noise, and can lead to a significant 
reduction in wavefront error. 

The ability of DARC to perform adaptive windowing, 
or spot tracking with Shack-Har tmann wavefront s ensors, 
has been mentioned previously (|Basden et al.l l2010). Fur- 
ther functionality has now been added allowing groups of 
sub-apertures to be specified, for which the adaptive window 
positions are computed based on the mean slope measure- 
ments for this group, and hence these grouped sub-apertures 
all move together. This allows, for example, tip-tilt tracking 
on a per-camera basis when multiple wavefront sensors are 
combined onto the same detector. 

One danger with adaptive windowing is that in the 
event of a spurious signal (for example a cosmic ray event) or 
if the signal gets lost (for example intermittent cloud) , then 
the adaptive windows can move away from the true spot 
location. Adaptive window locations will then be updated 
based upon noise, and so the windows will move randomly 
until they find their Shack-Hartmann spot, or fix on another 
nearby spot. To prevent this from happening, it is possible 
to specify the maximum amount by which each window is 
allowed to move from the nominal spot location. Addition- 
ally, adaptive window locations are computed from the local 
spot position using an infinite impulse response (IIR) filter, 
for which the gain can be specified by the user, helping to 
reduce the impact of spurious signals. 

In addition to weighted centre of gravity and correlation 
based slope computation, a matched filter algorithm can also 
be used. DARC can be used not only with Shack-Hartmann 
wavefront sensors, but also with Pyramid wavefront sensors 
and, in theory, with curvature wavefront sensors (though 
this has never been tested), due to the flexible method by 
which pixels are assigned to sub-apertures (which can be 
done in an arbitrary fashion) . The modular nature of DARC 
means that o ther sensor types could e asily be added, for ex- 
ample YAW (|Gendron. E. et alj|2010h and op tically binned 
Shack-Hartmann sensors (IBasden et al.l l2007). 



2.5.2 Reconstruction and mirror control 

In addition to matrix- vector based wavefront reconstruction 
(allowing for least-squares and minimum mean square error 
algorithms), linear quadratic gaussian (LQG) reconstruction 
can also be carried out, allowing for vibration suppression, 
which can lead to a significant performance improvement 
l|Correia et al.ll2010l) . An iterative solver based on precondi- 
tioned conjugate gradient is also available, and can be used 
with both spar se and dense syste ms. Using this reconstruc- 
tion technique ( Gilles et alj|2003h has the advantage that a 
matrix inversion to compute a control matrix is not required. 
In addition, an open-loop control formulation is available 
which allows DM commands (a) to be computed according 
to 



at = (1 - g)E ■ a,_i + gR- Si 



(1) 
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where Sj are the current wavefront slope measurements, g is 
a gain parameter, R is the control matrix, and E is a square 
matrix, which for an integrator control law would be equal 
to an identity matrix scaled by yrr^- 

DARC also contains the ability to perform automatic 
loop opening in the case of actuator saturation, i.e. to au- 
tomatically open the control loop and flatten the DM if a 
predefined number of actuators reach a predefined satura- 
tion value. This can be important to avoid damage while 
testing new algorithms and control laws. 

Some DMs display hysteresis and other non-linear be- 
haviour that results in the shaped DM not forming quite the 
shape that was requested. To help reduce this effect, DARC 
includes the option to perform actuator oscillation around 
the desired DM shape position, allowing the effect of hys- 
teresis to be greatly reduced. Typically, a decaying sine wave 
is used, with the decay leading to the desired position. 

Since DARC has the ability to update parameters on 
a frame-by-frame basis, it has the ability to modulate (or 
apply any time-varying signal to) some or all of the DM 
actuator demands, allowing complex system operations to 
be performed. 

2.5.3 Advanced computation 

DARC has the ability to operate multiple WFSs asyn- 
chronously, i.e. at independent frame rates (which are not 
required to be multiples of each other). This gives the abil- 
ity to optimise wavefront sensor frame rate depending on 
guide star brightness, and so can lead to an improvement of 
AO system performance. This capability is achieved by us- 
ing multiple instances of DARC, one for each WFS, which 
compute partial DM commands based on the WFS data, 
and a further instance of DARC which combines the partial 
DM commands once they are ready, using shared memory 
and mutual exclusion locks for inter-process communication. 
The way in which the partial DM commands are combined 
is flexible, with the most common option being to combine 
these partial commands together as they become available 
and then update the DM, meaning that the DM is always 
updated with minimal latency. 

The ability to operate multiple instances of DARC, 
combined with modularity, means that DARC can be used 
in a distributed fashion, allowing computational load to be 
spread over a computing grid. To achieve this, modules re- 
sponsible for distributing or collating data at different points 
in the computation pipeline are used, with data being com- 
municated over the most efficient transport mechanism. At 
present, such modules exist based on standard Internet sock- 
ets, and also using shared memory. This flexibility allows 
DARC to be optimised for available hardware. Extremely 
demanding system requirements can therefore be met. 

2.6 Hardware acceleration 

The modular nature of DARC makes it ideal for use with ac- 
celeration hardware, such as field programmable gate arrays 
(FPGAs) and CPUs. Two hardware acceleration modules 
currently exist for DARC (and of course more can easily 
be added). These are an FPGA based pixel processing unit 
that can perform image calibration and optionally wavefront 



slope calculation and is described elsewhere (|Fedrigo et al.l 
2006), and a GPU based wavefront reconstruction module, 
which we now describe. 

2.6.1 GPU wavefront reconstruction 

The GPU wavefront reconstruction module can be used with 
any CUD A compatible GPU. It performs a matrix- vector 
multiplication based wavefront reconstruction, which (de- 
pending on the matrix) includes least squares and minimum 
variance reconstruction. As discussed previously, DARC uses 
a horizontal processing strategy processing sub-apertures as 
the corresponding pixel data become available. The GPU re- 
construction module also follows this strategy, with partial 
reconstructions (using appropriate subsets of the matrix) 
being performed before combination to provide the recon- 
structed wavefront. 

The control matrix is stored in GPU memory, and wave- 
front slope measurements are uploaded to the GPU every 
frame. The GPU kernel that performs the operations has 
been written specifically for DARC, providing a 70 % perfor- 
mance improvement over the standard cuBlas library which 
is available from the GPU manufacturer NVIDIA. This mod- 
ule uses single precision floating point data. 

The performance reached by this module is limited by 
GPU internal memory bandwidth rather than computation 
power, since the matrix has to be read from GPU mem- 
ory into GPU processors every frame. We are able to reach 
about 70 % of peak theoretical performance on a NVIDIA 
Tesla 2070 GPU card. An alternative version of this DARC 
module also exists which stores the control matrix in a 16 
bit integer format, with conversion to single precision float- 
ing point performed each frame before multiplication. This 
allows a performance improvement of about 80 % (reduc- 
ing computation time by 45 %) with a trade-off of reduced 
precision. However, as shown bv lBasden etall (|201Ch , 16 bit 
precision is certainly sufficient for some ELT scale AO sys- 
tems (it may not be sufficient for a high contrast system, 
though we have not investigated this). It should be noted 
that slope measurements for astronomical AO systems are 
typically accurate to at most 10-11 bits of precision, limited 
by photon noise. 



3 DARC OPTIMISATION 

DARC has been designed to provide low latency control 
for AO, operating with a baseline of minimal hardware (a 
computer), whilst also providing the ability to use high 
end hardware, and hardware acceleration. This has been 
achieved by careful management of the workload given to 
processor threads using a horizontal processing strategy, and 
by reducing the ne ed for synchronisation, as described by 
iBasden et all (|2010t) . DARC has since been updated to fur- 
ther reduce this latency, with steps being taken to reduce the 
number of thread synchronisation points (thus reducing syn- 
chronisation delays), and also providing control over where 
post-processing is performed. The increased modularisation 
of DARC has led to a clearer code structure and allowed the 
thread synchronisation points to be rationalised, providing 
opportunities for better optimisation of modules, which can 
be fitted into the DARC structure more easily. 
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3.1 Site optimisation 

To optimise DARC for a specific application, there are sev- 
eral steps that can be taken to allow best performance (low- 
est latency) to be achieved for a given hardware setup. For 
all of these steps, no recompilation is necessary, and many 
can be performed without stopping DARC. In this section 
we will present these optimisations and discuss why they can 
make a difference, and how they should be applied. The large 
number of optimisation points built into DARC mean that 
it is well suited to meet the demands of future instruments. 




Number of threads used 



3.1.1 Number of threads 

The main way to optimise DARC performance is to ad- 
just the number of processing threads used. A higher order 
AO system will have reduced latency when more processing 
threads are used; however, the number of threads should 
be less than the number of processing cores available, which 
will in turn depend on the processing hardware. The balance 
between processing power and memory bandwidth can also 
affect the optimal number of processing threads. Fig.[4]shows 
how maximum achievable frame rate is affected by the num- 
ber of processing threads for a 40 x 40 sub-aperture single 
conjugate adaptive optics (SCAO) system, and it should be 
noted that a higher frame rate corresponds to lower latency. 
The frame rates displayed here were measured using a com- 
puter with two 6-core Intel E5645 processors with a clock 
speed of 2.4 GHz and hyper-threading enabled, giving a to- 
tal of 24 processing cores (12 physical). We have restricted 
threads to run on a single core (i.e. no thread migration, by 
setting the thread affinity), with the first six threads run- 
ning on the first CPU, the next six the second CPU, the 
next six on the first and so on as required (so each processor 
core may have multiple threads, but each thread is only al- 
lowed on a single core). It can be seen from this figure that 
DARC gives near linear performance scaling with number of 
threads up to the number equal to the number of physical 
CPU cores (12), at which point, maximum performance is 
achieved. This is followed by a dip at 13 and 14 threads as 
one core is then having to run two DARC threads. Perfor- 
mance then increases again up to 24 threads (except for a 
dip at 22 threads, which we are unsure about, but which is 
repeatable), after which performance levels off (and eventu- 
ally falls) as each core has to run an increasing number of 
threads, and thread synchronisation then begins to take its 
toll. 

This figure shows that hyper-threading is detrimental 
to performance in this case, and we recommend that the 
use of hyper-threading should be investigated by any DARC 
implementer and switched off (in the computer BIOS) if it is 
detrimental. However, this is not a condition that we would 
wish to impose because all situations are different and some 
users may find it desirable to have hyper-threading. 

It should be noted that there is also the option of using 
a separate thread for initialisation and post-processing of 
data, or to have this work done by one of the main processing 
threads, and each option can provide better performance for 
different situations. 



Figure 4. A plot showing how AO system maximum achievable 
frame rate depends on the number of DARC processing threads 
used. Uncertainties are shown but generally too small to discern. 

3.1.2 Affinity and priority of threads 

By giving processing threads elevated priorities the Linux 
kernel will put more effort into running these threads. How- 
ever, when using DARC, best performance is not necessarily 
achieved by giving all threads high (or maximum) priority. 
Rather, there can be an advantage in considering the work 
that each individual thread is required to do, and whether 
particular threads are likely to be on the critical path at any 
given point. 

As a simple example, consider the case of an AO sys- 
tem comprising a high order WFS and a tip-tilt sensor. In 
DARC, one thread would be assigned to the tip-tilt sensor, 
which has very little work to do, while a number of threads 
would be assigned to the high order WFS, each of which will 
have more computational demand. There will also be a post- 
processing thread which is used for operations that are not 
suitable for multi-threaded use, for example sending mirror 
demands to a deformable mirror. In this case, lowest latency 
will be achieved by giving highest priority to the high order 
WFS processing threads. The tip-tilt thread should be given 
a lower priority, and its work will be completed during com- 
putation gaps. The post-processing thread should be given 
a higher priority, so that it will complete as soon as all data 
are available, reducing latency. 

The location of threads can also be specified, restricting 
a given thread to run on one, or a subset of CPU cores. This 
prevents the kernel from migrating threads to different cores, 
and also allows threads to be placed closest to hardware in 
non-uniform systems (for example where one CPU has direct 
access to an interface bus), so improving performance. A fine 
tuning of latency and jitter can be achieved in this way. 



3.1.3 Sub-aperture numbers and ordering 

Another optimisation that can be made, but which is far less 
obvious, is to ensure that there are an even number of sub- 
apertures defined for each wavefront sensor, and that pairs of 
sub-apertures are processed by the same processing thread 
(ensured using the sub-aperture allocation facility). This en- 
sures that wavefront slopes are aligned on a 16 byte memory 
boundary for the partial matrix-vector multiplication during 
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the wavefront reconstruction processing, and allows stream- 
ing SIMD extension (SSE) operations (vector operations) to 
be carried out. It should be noted that if the system con- 
tains an odd number of sub-apertures, an additional one can 
be added that has no impact on the final DM calculations 
simply by adding a column of zeros to a control matrix. 
This optimisation can have a large impact on performance. 
Where possible, all matrix and vector operations in DARC 
are carried out using data aligned to a 16 byte boundary to 
make maximum use of SSE operations. 



System 


Wavefront 


Dcformablc 


Frame 


Tele; 












VLT planet-finder 


40 X 40 


41 X 41 


2 kHz 


8 m 


Palm-3000 


62 X 62 


3300 


2 kHz 


5 m 


ELT-EAGLE 


11 of 84 X 84 


20 of 85 x 85 


250 Hz 


~40 



Table 1. A table detailing some existing and proposed AO sys- 
tems with demanding computational requirements for the real- 
time system. The high degree of correction make these systems 
cutting-edge in the science they can deliver. 



3.1.4 Pixel readout and handling 

A low latency AO system will usually use a WFS camera 
for which pixels can be made available for processing as 
they are read out of the camera, rather than on a frame- 
by-frame basis, or at least made available in chunks smaller 
than a frame. This allows processing to begin before the 
full frame is available, and since camera readout is generally 
slow, this can give a significant latency reduction, and can 
mean that minimal operations are required once the last 
pixel arrives. With DARC, optimisations can be made by 
optimising the "chunk" size, i.e. the number of pixels that 
are made available for processing together. A smaller chunk 
size will require a larger number of interrupts to be raised, 
and also a larger number of data requests (typically direct 
memory access (DMA)). Conversely, a larger chunk size will 
mean that there is a greater delay between pixels leaving 
the camera and becoming available for processing. There is 
therefore a trade-off to be made, that will depend on cam- 
era type, data acquisition type and processor performance 
among other things. 

Related to this is an optimisation that allows DARC to 
process sub-apertures in groups. There is a parameter that 
is used by DARC to specify the number of pixels that must 
have arrived before a given sub-aperture can be processed. 
If this value is rounded up to the nearest multiple of chunk 
size then there will be multiple sub-apertures waiting for the 
same number of pixels to have arrived, and hence these can 
be processed together, reducing the number of function calls 
required, and hence the latency. The processing of particular 
sub-apertures can be assigned to particular threads to help 
facilitate this optimisation. 

3.1.5 Linux kernel impact 

The Linux kernel version with which DARC is run can also 
have an impact on latency. We do not have a definitive an- 
swer to which is the best kernel to use, because this de- 
pends somewhat on the system hardware, and also AO sys- 
tem order. We also find that when using a real-time kernel 
(with the RT-preempt patch), latency (as well as jitter) is 
slightly reduced. Therefore, if DARC is struggling to reach 
the desired latency for a given system, investigating different 
kernels may prove fruitful. This can also impact diagnostic 
bandwidth too. 

3.1.6 Grid utilisation 

More ambitious latency reduction can be achieved by 
spreading the DARC computational load across a grid com- 



puting cluster. For some AO systems, the division of work 
will fall naturally. For example, systems with more than one 
WFS could place each WFS on a separate grid node, be- 
fore combining the results in a further node which sends 
commands to a DM. For other AO systems, the division 
of labour may not be so obvious, for example a extreme 
adaptive optics (XAO) system with only a single wavefront 
sensor. In cases such as this, separation would be on a per 
sub-aperture basis with responsibility for processing differ- 
ent sub-apertures being placed on different grid nodes. How- 
ever, to achieve optimal performance in this case it must be 
possible to split WFS camera pixels between nodes using 
for example a cable splitter, rather than reading the WFS 
into one node and then distributing pixels from there, which 
would introduce additional latency. 

The effectiveness of using DARC in a grid computing 
environment depends to some extent on the communication 
link between grid nodes. Real-time data must be passed be- 
tween these nodes, and so we recommend that dedicated 
links be used, that will not be used for other communica- 
tions, such as diagnostic data. These links should also be de- 
terministic to reduce system jitter. Additionally, the higher 
the performance of these links, the lower the overall latency 
will be. A DARC implementer will need to implement their 
own DARC modules according to the communication pro- 
tocol and hardware that they use, as the standard DARC 
package contains only a TCP/IP implementation that is not 
ideal for low jitter requirements. It will also be necessary for 
the implementer to ensure that a lower latency is achieved 
when using a grid of computers than can be achieved on a 
single computer. 



4 DARC IMPLEMENTATIONS 

Optimising DARC for a particular AO system requires some 
thought. Using hardware available at Durham, consisting 
of a dual six-core processor server (Intel E5645 processors 
with a clock speed of 2.4 GHz) with three NVIDIA Tesla 
2070 GPU acceleration cards, we have implemented the real- 
time control component of several cutting edge existing and 
proposed AO systems, as given in Table Q] In doing this, we 
have sought to minimise the latency that can be achieved 
using DARC for these systems. In the following sections we 
discuss these implementations, the achievable performance 
and the implications that this has. 
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4.1 DARC as a RTC for system resembling a 
VLT planet-finder 



A planet-finder class instrument is currently under develop- 
ment for one of the VLT telesco pes in Chile. The A O system 
for this instrument, SPHERE (|Fusco et al.l 120061 ). is based 
on a 40 x 40 sub-aperture Shack-Hartmann extreme AO sys- 
tem. Real-time control will be provided by an ESO standard 
SPARTA system, comprised of a Xilinx Virtex II-Pro FPGA 
for pixel processing (WFS calibration and slope calculation) , 
and a DSP based system for wavefront reconstruction and 
mirror control, comprised of four modules of eight Bittware 
Tigershark TS201 DSPs. 

This AO system is required to operate at a frame rate 
of at least 1.2 kHz, and a goal of 2 kHz, with a total latency 
(inclu ding detector read out and mirror drive) of less than 
1 ms jFusco et alj|2006h . 

We have implemented an equivalent AO RTCS using 
DARC, based on the aforementioned server PC, but with- 
out using the GPU acceleration cards. It should be noted 
that we do not have an appropriate camera or DM to model 
this system, and so the implementation here does not in- 
clude these system aspects, however it does include all other 
aspects of a real-time control system, including interleaved 
processing and readout and thread synchronisation. We are 
able to operate this system at a maximum frame rate of 
over 3 kHz, using 12 CPU threads, corresponding to a total 
mean frame processing time of about 323 fis, as shown in 
Fig. |4] In a real system (with a real camera), WFS readout 
would be interleaved with processing, and so the latency, 
defined in this paper as the time taken from last pixel re- 
ceived to last DM actuator set, would be significantly less 
than this, because most processing would occur while wait- 
ing for pixels to be read out of a camera and sent to the 
RTCS. Therefore, the latency measured from last pixel ac- 
quired to last command out of the box is expected to be well 
below 100 /us, well within the required specification. Here, 
we find that processing sub-apertures in blocks of about 40 
at a time gives best performance. When interpreting these 
results, it should be noted that in a standard configuration 
as used here, DARC does not allow pipelining of frames: 
computation of one frame must complete before the next 
frame begins, and so the maximum frame rate achievable is 
the inverse of the frame computation time. 

DARC performance for this configuration was assessed 
by using the Linux real-time clock to measure frame compu- 
tation time. A real-time Linux kernel (2. 6. 31-10. rt, available 
from Ubuntu archives) was used. The frame computation 
time was measured to be 323 ± 11 /is, averaged over ten mil- 
lion consecutive frames, a histogram of which is shown in 
Fig. [5] The system jitter of 11 pis root-mean-square (RMS) 
was measured (the standard deviation of frame computation 
time). The maximum frame time measured over this period 
was 508 fis, though this (as can be seen from the standard 
deviation) was an extremely rare event. In fact, only 501 
frames (out of 10 7 ) took longer than 400 /xs to compute, 
only 106 frames took longer than 425 ps, and only 2 frames 
took longer than 500 fis. The mean frame computation time 
here corresponds to a frame rate of greater than 3 kHz. 

Equivalent measurements made with a stock Linux ker- 
nel (2.6.32) did not give significantly worse performance, 




250 300 350 400 450 500 550 

Frame time / us 



Figure 5. A histogram of frame computation times for a 40x40 
sub-aperture AO system measured with 10 7 samples. Inset is 
shown a logarithmic histogram (showing outliers more clearly), 
and also a linear histogram of outliers only. This clearly shows 
that jitter is well constrained. 

with the mean frame computation time increasing slightly 
to 343 ± 13 fis and no increase in the processing time tail. 

The introduction of a camera and a DM to this system 
would increase the frame computation time due to the ne- 
cessity to transfer pixel data and DM demand data, however 
our experience shows that we would not expect a significant 
increase in computation time, and maximum frame rates 
greater than 2 kHz would still be easily achievable. 

It is interesting to note that more recent stock Ubuntu 
kernels (2.6.35 and 2.6.38) give far worse performance, al- 
most doubling the frame computation time. At this point 
we have not investigated further, though intend to do so. 
This could be due to parameters used during kernel compi- 
lation, or due to actual changes in the kernel source, though 
other available bench marks do not suggest that this is a 
likely problem. However, it is worth bearing mind that per- 
formance of a software based AO control system may be 
dependent on the operating system kernel used. 

Typically, a design for an AO real-time control system is 
made before actual hardware and software is available, and 
so predictability of performance of a CPU-based real-time 
controller is not usually well defined. 

However, by using preexisting real-time control software 
such as DARC, performance predictability can be improved, 
as it allows hardware to be purchased earlier in the develop- 
ment cycle of a real-time control system, for immediate use. 
This removes much of the uncertainty of CPU-based con- 
trollers much earlier in the design and prototyping phases 
of AO system development. 

4.2 DARC as a RTC for a system resembling the 
Palm-3000 AO system 

The Palm-3000 AO system on the five meter Hale tele- 
scope is the highes t order AO system yet commissioned 
|Truong et al.ll2008l ). and has the highest computational de- 
mands. At highest order correction, it uses a 62 x 62 Shack- 
Hartmann wavefront sensor and a DM with about 3300 ac- 
tive elements, and is specified to have a maximum frame 
rate of 2 kHz. For real-time control, this system uses 8 PCs 
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and 16 GPUs, far more hardware than we have available at 
our disposal. However, by using DARC on our existing (pre- 
viously mentioned) hardware, we have been able to meet 
this specification, though again, as we do not have suitable 
cameras and DMs, our measurements do not include these. 

In order to implement this system, we use the three 
Tesla GPUs for wavefront reconstruction, spreading the 
matrix-vector multiplication equally between then. The ma- 
trix in each GPU has dimensions equal to iVact x A^i opes /3, 
with iV ac t being the number of DM actuators, and N B \ opeB 
being the number of slopes (twice the number of active sub- 
apertures). In order to interleave camera read-out with pixel 
processing, we split these multiplications into four blocks, 
i.e. perform a partial matrix-vector multiplication once a 
quarter, a half, three quarters and all the slope measure- 
ments for each GPU have been computed, thus performing 
a total of twelve matrix-vector multiplications per frame. It 
is interesting to note that the actual Palm 3000 RTCS splits 
processing into two blocks per GPU, i.e. processing occurs 
half way through pixel arrival and after all pixels have ar- 
rived. Our implementation performs pixel calibration and 
slope calculation in CPU, dividing the work equally between 
twelve processor cores using twelve processing threads. 

By configuring DARC in this way, we are able to achieve 
a frame rate of 2 kHz and a corresponding frame process- 
ing time of 500 [is. The addition of a real camera and DM 
would increase this processing time meaning that the official 
specifications would not be met. However by adding a fourth 
GPU, performance could be increased. 

It should be noted that our GPUs are of a higher specifi- 
cation, with memory bandwidth (the bottleneck) being 65 % 
higher than those used by Palm-3000, which will account for 
some of the difference. We also perform calibration and slope 
computation in CPU, while the Palm-3000 system performs 
this in GPU, and the Palm-3000 system will include some 
overhead for contingency (i.e. the maximum frame-rate is 
likely to be greater than 2 kHz should the wavefront sensor 
allow it). Because all our calculations are performed within 
one computing node, we do not suffer from increased latency 
due to computer-computer communication, which the Palm- 
3000 system will include. These differences help to explain 
how DARC is able to perform the same task using signifi- 
cantly less hardware and a simpler design. 

This demonstrates that DARC is suitable for use with 
high order AO correction, despite being primarily CPU 
based, i.e. CPU based real-time control systems are suffi- 
ciently powerful for current AO systems. 



4.3 DARC as a RTC for a system resembling 
E-ELT EAGLE 

A multi-object spectrograph for the European ELT (E- 
ELT) is currently in the design phase. This instrument, EA- 
GLE llCubv et alj|2008h . will have a multi-object AO system 
l|Rousset et al. 201p| ) . with independent wavefront correc- 
tion along multiple lines of sight, in directions not neces- 
sarily aligned with wavefront sensors, as shown in Fig. [5] A 
current design for EAGLE consists of 11 WFSs (of which six 
use laser guide stars), and up to 20 correction arms. Each 
WFS has 84 x 84 sub-apertures, and each correction arm 
contains an 85 x 85 actuator DM, updated at a desired rate 
of 250 Hz. 




Laser guide star 



Natural guide star 
£§3 Science target 



Telescope field of view 



Figure 6. A figure demonstrating multi-object adaptive optics. 
The large circle represents the telescope field of view, and six 
laser guide stars and associated wavefront sensors are arranged 
in a ring near the edge of this. Five natural guide star wavefront 
sensors are then placed on appropriate stars, and multiple science 
targets, each with a dcformable mirror can then be picked out. 



The real-time control requirement for this system is 
demanding, though each correction arm is decoupled, and 
hence can treated as a separate AO system. Therefore each 
of these systems will have 11 84 x 84 WFSs, and one 85 x 85 
actuator DM, leading to a total of about 125000 slope mea- 
surements and 5800 active actuators. A control matrix for 
this system would have a size of nearly 3 GB (assuming el- 
ements are stored as 32 bit floating point). Generally, for 
a system of this size, processing power is not the limiting 
factor. Rather, it is the memory bandwidth required to load 
this control matrix from memory into the processing units 
for each frame (be they CPUs, GPUs, FPGAs etc.), which 
in this case is equal to about 725 GB/s (matrix size mul- 
tiplied by frame rate). The 3-GPU system that we have in 
Durham has a peak theoretical bandwidth of 444GB/s, and 
our matrix-vector multiplication core is able to reach about 
70% of this. Theoretical, or even measured matrix multipli- 
cation rates are however not a good bench mark for a real- 
time control system. This is because the RTCS will perform 
additional operations, which will affect cache or resource us- 
age, and furthermore, the multiplication will be broken up 
into blocks allowing reconstruction to be interspersed with 
pixel readout. DARC is an ideal tool for such benchmark- 
ing, as it is both a full RTCS, but also flexible enough to 
investigate parameters for optimal performance. 

By using the three Tesla GPUs that we have available, 
we are able to process one wavefront sensor on each GPU 
at a frame rate of about 300 Hz, with wavefront reconstruc- 



tion for one correction arm (i.e. 



of a single channel). 



We find that the frame rate falls slightly with the number 
of WFSs processed (and hence the number of GPUs used) 
as shown in Fig. [7] If we consider how a system containing 
eight GPUs might behave (the maximum number of GPUs 
that can be placed in a single PC), then an extrapolation 
to eight WFSs and GPUs (which we realise is rather dubi- 
ous from the available data, but will suffice for the argu- 
ment being made here) might bring the frame rate down to 
about 280 Hz. For a single channel of EAGLE, reconstruc- 
tion from eleven WFSs is required, which if divided between 
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Figure 7. A figure showing how frame rate is affected by the 
number of wavefront sensors processed for an EAGLE-like system, 
assuming a GPU dedicated to each wavefront sensor all on the 
same computer. An exponential trend line has been fitted to the 
data. 



eight GPUs would reduce the frame rate to about 200Hz. If 
we were to change the GPUs used from Tesla 2070 cards to 
more powerful (yet cheaper) GeForce 580 cards (increasing 
the GPU internal memory bandwidth from 148 GBs -1 to 
192 GBs -1 ), the frame-rate could be increased to 260 Hz. 

Since EAGLE channels are essentially independent, we 
could then replicate this system using a single computer and 
eight GPUs for each channel (and one more to control the 
E-ELT deformable M4 mirror with 85 x 85 actuators) , giving 
a full real-time solution for EAGLE. In fact, this situation is 
even simpler, since we only need to perform wavefront slope 
measurement once per frame, rather than once per channel 
per frame, and so a front end (possibly FPGA based for min- 
imal latency, for example the SPARTA wavefront process- 
ing unit) could be used to compute wavefront slopes, which 
would greatly reduce the CPU processing power required for 
each channel (though GPU processing requirement would 
remain unchanged). 

It should be noted that in the case of wavefront re- 
construction only (assuming slope calculation is carried out 
elsewhere), then our system in Durham is able to achieve 
a frame rate of 400 Hz, independent of whether we process 
slopes from one, two or three WFSs using a GPU for each. 
We are therefore confident that such a system would allow 
us to implement the entire EAGLE real-time control sys- 
tem using currently available hardware and software. Given 
the performance improvements promised in both multi-core 
CPUs and in GPUs over the next few years, a CPU and 
GPU based solution for EAGLE is even more feasible. 



4.4 Future real-time control requirements 

In addition to the consideration of real-time control for EA- 
GLE given in A4.3I we should also consider other future 
real-time requirements, whether DARC will be able to meet 
these, and what the AO community will require. 

The most demanding currently proposed instrument (in 
terms of computational requirement) is probab ly Exo-Planet 
Imag ing Camera and Spectrograph (EPICS) (|Kasper et al.l 
l2007f ) for the E-ELT. This XAO system consists of a WFS 



with 200 x 200 sub-apertures and a frame rate of 2 kHz. Us- 
ing DARC with currently available GPUs (NVIDIA GeForce 
580 with a memory bandwidth of 192 GBs -1 ) performing 
matrix- vector multiplication based wavefront reconstruction 
would require a system with at least 150 GPUs. Although 
this is a similar number to that required by EAGLE, for 
EPICS the results from each GPU must be combined with 
results from all other GPUs, and to the authors, this does 
not seem to be a practical solution when a frame rate of 
2 kHz is required (for EAGLE, only computations from sets 
of eight GPUs need to be combined since the MOAO arms 
can be treated independently). On this scale, other recon- 
struction algorithms do not seem appropriate: Conjugate 
gradient algorithms cannot be massively parallelised to this 
degree, and, as far a s we are aware, algorithms such as CuRe 
l|Rosensteinedl2lnih do not yet provide the correction accu- 
racy required. Therefore, we must conclude that DARC is 
not suitable for this application. However, given that this in- 
strument is at least 10 years away, more powerful hardware 
and more suitable algorithms are likely to become available. 

To our knowledge, other proposed instruments gener- 
ally have lower computational and memory bandwidth re- 
quirements than EAGLE, which we have shown that DARC 
would be capable of controlling. Therefore we are confident 
that DARC provides a real-time control solution for most 
proposed AO systems, and this demonstrates the suitability 
of CPU based real-time control systems for AO. It should 
be noted that it is only within the last few years, with the 
advent of multi-core CPUs and more recently GPU accel- 
eration that such systems have become feasible, giving the 
advantages of both greater processing power, and additional 
CPUs to handle non-real-time processes (such as operating 
system services and diagnostic systems), thus keeping jitter 
to a minimum. 



4.5 Wavefront sensor camera and deformable 
mirror specifications 

The timing measurements for DARC provided here are op- 
timistic since we do not include a physical wavefront sensor 
camera or DM. However, by considering the latency and 
frame-rate requirements for the systems for which we have 
investigated the performance of DARC, we can derive the 
specifications required for these hardware components that 
will allow the system to perform as desired. 

A frame computation time of 323 fis was measured for 
DARC operating in a VLT planet-finder configuration. The 
total latency for this system (including camera readout) 
must be below 1 ms. However, when camera readout time, 
and DM settling time is taken into account, this corresponds 
to an acceptable real-time control system latency (from last 
pixel received to last DM command leaving) of about 100 fis 
(E. Fedrigo, private communication). To achieve this la- 
tency, we can therefore specify that the camera pixel stream 
must be accessible in blocks that are equal to or less than 
quarter of an image in size (the time to process each block 
will then be about 80 fis, meaning that it will take this long 
to finish computation once the last block arrives at the com- 
puter, resulting in a 80 fis RTCS latency. 

The DARC configuration for a system resembling Palm- 
3000 provides a processing time of 500 fis using three GPUs. 
Once a real WFS camera and DM are added, this processing 
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time would increase meaning that the 2 kHz frame rate could 
no longer be met. Therefore, a fourth GPU is required to be 
added to the system, which would result in a processing 
time of less than 400 /js. This therefore provides a 100 fxs 
overhead to allow for the process of sending commands to 
the DM and obtaining pixel data from a frame grabber card, 
which should be more than ample (typically this will just be 
a command to initiate a DMA) , thus putting a performance 
requirement on the hardware. The maximum camera frame 
rate is 2019 Hz, corresponding to a readout time of 495 fis. 
This is greater than the processing time, meaning that we 
are therefore able to interleave processing with readout, with 
enough time to finish block processing between each block 
of pixels arriving. The RTCS latency (last pixel received to 
last DM command sent) will therefore be determined by the 
pixel block size. If we read the camera in four blocks (giving 
a 100 j-is processing time per block) then after the last pixel 
arrives, we will have a 100 /is processing time plus the 100 /is 
overhead that we have allowed for data transfer, a total of 
200 fis RTCS latency. For this system, we therefore require 
that the camera pixels can be accessed in four blocks, and 
that the time overhead for transferring pixels into memory 
and commanding the DM is less than 100 fis. 

Our configuration of DARC for an EAGLE-like instru- 
ment gives a processing time of 3.8 ms. To achieve the de- 
sired frame rate of 250 Hz, the total processing time must 
remain below 4 ms. We therefore have 200 fis in which to 
process the last block of pixels (processing of other blocks 
will be interleaved with pixel readout and thus will not con- 
tribute to the RTCS latency), and to send commands to 
the deformable mirror. If we divide the pixel data for each 
camera into 42 blocks (which corresponds to two rows of sub- 
apertures per block), then the processing time for each block 
is 90 fis, thus leaving 110 fis spare. We have therefore placed 
requirements on the DM and WFS camera for this system: 
We require a camera which will allow us to access the pixel 
stream in blocks of size 1/42 of a frame. We also require 
the time to receive this block of pixel data into computer 
memory, and to send DM demands from computer memory 
to be less than 100 fis. This equates to a data bandwidth of 
about 4 GBits s _1 , which is achievable with a single PCI- 
Express (generation 2) lane. This requirement can easily be 
met given that camera interface cards typically use multiple 
lanes, and a DM interface card could also use multiple lanes. 



5 CONCLUSION 

We have presented details of the Durham AO real-time con- 
trol system, including recent improvements. We have de- 
scribed some of the more advanced features and algorithms 
available with DARC and given details about the improved 
modularity and flexibility of the system (including the abil- 
ity to load and unload modules while in operation) . We have 
also discussed ways in which DARC can be optimised for 
specific AO systems. 

We have discussed how DARC can be used for almost 
all currently proposed future AO systems, and given perfor- 
mance estimates for some of these. We are well aware that 
without a physical camera and DM, the results that we have 
presented here do not represent the whole system, and so we 
have used these performance estimates to derive the wave- 



front sensor camera and DM requirements that are required 
to ensure an AO system can meet its design performance. 
The architecture of DARC, allowing interleaving of process- 
ing and camera readout means that AO system latency can 
be kept low, and so we are confident that DARC presents 
a real-time control solution that is well suited to ELT scale 
systems. 

By making use of GPU technology, we have been able to 
provide a real-time control system suited for all but the most 
ambitious of proposed ELT instruments, and have demon- 
strated that software and CPU based real-time control sys- 
tems have now come of age. 
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